49 CORESETS and SKETCHES

نویسنده

  • Jeff M. Phillips
چکیده

Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most important classes of these summaries. A coreset is a reduced data set which can be used as proxy for the full data set; the same algorithm can be run on the coreset as the full data set, and the result on the coreset approximates that on the full data set. It is often required or desired that the coreset is a subset of the original data set, but in some cases this is relaxed. A weighted coreset is one where each point is assigned a weight, perhaps different than it had in the original set. A weak coreset associated with a set of queries is one where the error guarantee holds for a query which (nearly) optimizes some criteria, but not necessarily all queries; a strong coreset provides error guarantees for all queries. A sketch is a compressed mapping of the full data set onto a data structure which is easy to update with new or changed data, and allows certain queries whose results approximate queries on the full data set. A linear sketch is one where the mapping is a linear function of each data point, thus making it easy for data to be added, subtracted, or modified. These definitions can blend together, and some summaries can be classified as either or both. The overarching connection is that the summary size will ideally depend only on the approximation guarantee but not the size of the original data set, although in some cases logarithmic dependence is acceptable. We focus on five types of coresets and sketches: shape-fitting (Section 49.1), density estimation (Section 49.2), high-dimensional vectors (Section 49.3), highdimensional point sets / matrices (Section 49.4), and clustering (Section 49.5). There are many other types of coresets and sketches (e.g., for graphs [AGM12] or Fourier transforms [IKP14]) which we do not cover for space or because they are less geometric.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

48 Coresets and Sketches

Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time, large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most impor...

متن کامل

Coresets and Sketches

Geometric data summarization has become an essential tool in both geometric approximation algorithms and where geometry intersects with big data problems. In linear or near-linear time large data sets can be compressed into a summary, and then more intricate algorithms can be run on the summaries whose results approximate those of the full data set. Coresets and sketches are the two most import...

متن کامل

One-Shot Coresets: The Case of k-Clustering

Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed. While these techniques provide provably good and small summaries, they are inherently problem dependent — the practitioner has to commit to a fixed clustering objective before even exploring the data....

متن کامل

Scalable and Distributed Clustering via Lightweight Coresets

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...

متن کامل

Coresets for Nonparametric Estimation - the Case of DP-Means

Scalable training of Bayesian nonparametric models is a notoriously difficult challenge. We explore the use of coresets – a data summarization technique originating from computational geometry – for this task. Coresets are weighted subsets of the data such that models trained on these coresets are provably competitive with models trained on the full dataset. Coresets sublinear in the dataset si...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016